Multivariate Data Embeddings¶
- Generate Multivariate Time Series Data: generate time series data for multiple variables, such as temperature, humidity, wind speed, and atmospheric pressure.
- Embed Data Using a Pre-trained Model: Use open-source libraries like sentence-transformers models to embed the data. Since time series data typically isn't natural language text, it required to either flatten or represent the data in a suitable way for embedding.
- The SentenceTransformer model to encode each row of data into embeddings. Since the model is designed for text data, the data in each row is represented as a string, combining the date and values for each variable.
- Store the Embeddings in a FAISS Vector Database: Store the generated embeddings in FAISS to allow efficient similarity search.
Implement RAG for Querying: Allow querying the vector database based on a user input, such as querying a specific variable (e.g., temperature, wind speed).
- Query and Retrieval: When a user provides a query (e.g., "temperature: 25"), we encode the query and perform a nearest neighbor search in the FAISS index to retrieve the top 3 most similar rows based on the embedding distance.
- Visualize Results: Display the original and predicted data, including embedding similarity, in markdown and use radar charts for visualization.
- Original Data: The original multivariate time series data is displayed as a markdown table.
- Predicted Data: The predicted data (top 3 most similar rows) is displayed in a markdown table with distances.
- Radar Chart: A radar chart is shown for the first predicted row, visualizing the relative values of the features.
In [ ]:
%pip install -q faiss-cpu sentence-transformers pandas numpy matplotlib
Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages. ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed. databricks-feature-store 0.14.3 requires pyspark<4,>=3.1.2, which is not installed. ydata-profiling 4.2.0 requires numpy<1.24,>=1.16.0, but you have numpy 2.1.3 which is incompatible. ydata-profiling 4.2.0 requires scipy<1.11,>=1.4.1, but you have scipy 1.14.1 which is incompatible. numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 2.1.3 which is incompatible. mleap 0.20.0 requires scikit-learn<0.23.0,>=0.22.0, but you have scikit-learn 1.1.1 which is incompatible. langchain 0.0.217 requires numpy<2,>=1, but you have numpy 2.1.3 which is incompatible. databricks-feature-store 0.14.3 requires numpy<2,>=1.19.2, but you have numpy 2.1.3 which is incompatible. Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
In [ ]:
import numpy as np
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
# Step 1: Generate the multivariate time series data
np.random.seed(0) # For reproducibility
dates = pd.date_range(start='2024-01-01', periods=7, freq='D')
temperature = np.random.uniform(15, 30, size=7) # Temperature in °C
humidity = np.random.uniform(30, 90, size=7) # Humidity in %
wind_speed = np.random.uniform(0, 15, size=7) # Wind speed in km/h
pressure = np.random.uniform(980, 1050, size=7) # Atmospheric pressure in hPa
# Combine into a DataFrame
data = {
'date': dates,
'temperature': temperature,
'humidity': humidity,
'wind_speed': wind_speed,
'pressure': pressure
}
df = pd.DataFrame(data)
# Step 2: Embedding the multivariate data using SentenceTransformer
def embed_data(df):
model = SentenceTransformer('paraphrase-MiniLM-L6-v2') # Using a pre-trained model for general embeddings
# Create a string representation of each row to feed into the model
texts = df.apply(lambda row: f"date: {row['date']} temperature: {row['temperature']} humidity: {row['humidity']} wind_speed: {row['wind_speed']} pressure: {row['pressure']}", axis=1).tolist()
# Create embeddings
embeddings = model.encode(texts)
return embeddings
# Embedding the data
embeddings = embed_data(df)
# Step 3: Store embeddings in FAISS vector database
def store_embeddings(embeddings):
embeddings = np.array(embeddings).astype('float32')
index = faiss.IndexFlatL2(embeddings.shape[1]) # L2 distance metric
index.add(embeddings)
return index
# Store the embeddings in the FAISS index
index = store_embeddings(embeddings)
# Step 4: Retrieval-Augmented Generation (RAG) System to query embeddings
def retrieve_similar_data(query, index, k=3):
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
query_embedding = model.encode([query]).astype('float32')
D, I = index.search(query_embedding, k) # D is distances, I is indices of nearest neighbors
return I[0], D[0]
# Step 5: Display the original and predicted data
# Create a DataFrame for embeddings
original_embedding_df = pd.DataFrame(embeddings, columns=[f"dim_{i+1}" for i in range(embeddings.shape[1])])
original_embedding_len_df = len(original_embedding_df.columns)
original_embedding_row_count_df = original_embedding_df.shape[0]
# Select the first 5 columns
original_embedding_5_df = original_embedding_df.iloc[:,:5].head(5)
# Display the original multivariate data as markdown table
original_df = df.head(7)
print("\nMultivariate Time Series Weather Data:\n", original_df.to_markdown(tablefmt="github", index=False))
# Display the original multivariate data embeddings as markdown table
print("\nMultivariate Data Embeddings: (first 5 dimensions)")
print(f"For 7 rows and 5 columns of Multivariate Data {original_embedding_len_df} vectors dimensions were created\n")
print(original_embedding_5_df.to_markdown(index=False)) # Display the embeddings in markdown table format
# Query input: Let's query by a specific variable
user_query = "temperature: 25" # "humidity: 77" or "temperature: 25"
print(f"\nUser Query: {user_query}")
# Retrieve top 3 best matches based on the query
indices, distances = retrieve_similar_data(user_query, index)
# Get the top 3 predicted data
predicted_df = df.iloc[indices].reset_index(drop=True)
predicted_df['embedding_distance'] = distances
predicted_df = predicted_df.sort_values(by='embedding_distance', ascending=False).reset_index(drop=True)
print("\nPredicted Multivariate Data:\n", predicted_df.to_markdown(tablefmt="github", index=False))
# Step 6: Visualize predicted data as a multi-group spider (radar) chart
def plot_multi_group_spider_chart(data, title):
categories = ['temperature', 'humidity', 'wind_speed', 'pressure']
# Normalize the data to [0, 1] range for radar chart
scaler = MinMaxScaler()
values = scaler.fit_transform(data[categories].values)
# Setup the angles for the radar chart
num_vars = len(categories)
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1] # Close the loop
fig, ax = plt.subplots(figsize=(6, 6), dpi=80, subplot_kw=dict(polar=True)) # figsize=(8, 8)
# Plot each group (predicted data rows)
for i, row in data.iterrows():
row_values = np.concatenate([values[i], values[i][:1]]) # Close the loop for the group
ax.fill(angles, row_values, alpha=0.25)
ax.plot(angles, row_values, label=f'Prediction {i+1}', linewidth=2)
# Set the ticks and labels for the axes
ax.set_yticklabels([]) # Hide radial labels
ax.set_xticks(angles[:-1]) # Set the x-ticks to be the categories
ax.set_xticklabels(categories, fontsize=12) # Set the labels of each axis
# Add a legend
ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1.1), fontsize=12)
plt.title(title, size=14)
plt.show()
# Display multi-group spider chart for the top 3 predicted data rows
plot_multi_group_spider_chart(predicted_df, "Predicted Multivariate Data")
Multivariate Time Series Weather Data: | date | temperature | humidity | wind_speed | pressure | |---------------------|---------------|------------|--------------|------------| | 2024-01-01 00:00:00 | 23.2322 | 83.5064 | 1.06554 | 1035.94 | | 2024-01-02 00:00:00 | 25.7278 | 87.8198 | 1.30694 | 1012.3 | | 2024-01-03 00:00:00 | 24.0415 | 53.0065 | 0.303276 | 1034.64 | | 2024-01-04 00:00:00 | 23.1732 | 77.5035 | 12.4893 | 988.279 | | 2024-01-05 00:00:00 | 21.3548 | 61.7337 | 11.6724 | 1024.79 | | 2024-01-06 00:00:00 | 24.6884 | 64.0827 | 13.0502 | 990.035 | | 2024-01-07 00:00:00 | 21.5638 | 85.5358 | 14.6793 | 1046.13 | Multivariate Data Embeddings: (first 5 dimensions) For 7 rows and 5 columns of Multivariate Data 384 vectors dimensions were created | dim_1 | dim_2 | dim_3 | dim_4 | dim_5 | |----------:|---------:|---------:|---------:|---------:| | -0.368792 | 0.408502 | 0.588999 | 0.682211 | 0.617992 | | -0.388623 | 0.412094 | 0.581821 | 0.682733 | 0.61162 | | -0.436087 | 0.447782 | 0.602891 | 0.715155 | 0.631992 | | -0.432618 | 0.448834 | 0.549173 | 0.673219 | 0.551741 | | -0.392174 | 0.364377 | 0.567262 | 0.674651 | 0.493109 | User Query: temperature: 25 Predicted Multivariate Data: | date | temperature | humidity | wind_speed | pressure | embedding_distance | |---------------------|---------------|------------|--------------|------------|----------------------| | 2024-01-02 00:00:00 | 25.7278 | 87.8198 | 1.30694 | 1012.3 | 49.7495 | | 2024-01-07 00:00:00 | 21.5638 | 85.5358 | 14.6793 | 1046.13 | 49.5342 | | 2024-01-03 00:00:00 | 24.0415 | 53.0065 | 0.303276 | 1034.64 | 48.7736 |